Express Mail Certificate No. TB 293 997 737US 

Attorney Docket No. 9675-004 



Method and System for the Analysis of Variance of Microarray Data 

Field of Invention 

5 This invention relates to an efficient method for the analysis of high 

dimensionality datasets. Specifically, the invention relates to the analysis of 
variance of DNA microarray data, for example, from high-throughput gene 
expression experiments. 

Background of the Invention 
q 0 Citation or identification of any reference in this or any of the sections of this 

5 application shall not be construed to mean that it is available as prior art to the 
f= present invention. 

ffl DNA microarray experiments are powerful and cost-effective ways for 

Si determining gene expression with many applications. Such experiments have been 
Li 5 used to study gene expression in yeast under different stress condition, gene 
Oi expression profiles for tumors from cancer patients, and gene expression in the 
m livers of mice representing a model of maturity-onset type II diabetes wherein one 
P group of mice is fed a beta-3 adrenergic receptor agonist. In addition to helping 

scientists understand gene regulation and interactions, microarray experiments may 
20 be used to identify disease genes and targets for therapeutic drugs. 

In a typical DNA microarray experiment, a microarray is prepared by fixing or 
synthesizing known polynucleotides to a suitable substrate in a grid pattern. Each 
spot in the microarray is comprised of a purified polynucleotide and each 
polynucleotide may be placed in several spots on the microarray. Each spot is 
referred to herein as a "probe" or "gene." As used herein, "probe" and "target" follow 
the definitions adopted in The Chipping Forecast, volume 21 , Supplement to Nature 
Genetics, 1999. The microarray may have thousands of polynucleotide spots 
contained in an area of about 2 cm on each side. The microarrays are usually 
produced by an automated mechanical printing process so that the same 
polynucleotide is spotted on the same location in each array. Alternatively, the 

- 1 - 



microarray may be produced by synthesizing the polynucleotides directly on the 
surface of the substrate. 

Pools of purified mRNA are prepared from cell populations under study and 
reverse-transcribed into cDNA. The cDNA samples are labeled with fluorescent 
5 dyes such as the red fluorescent dye, Cy5, and the green fluorescent dye, Cy3. 
Other dyes may be used to tag the cDNA targets so the tagged targets are usually 
referred to by the Cy5/Cy3 colors, "red" and "green." The labeled cDNA samples are 
referred to herein as the "target" or "variety." The purpose of a typical cDNA 
microarray experiment is to determine the effect, if any, between the genes on the 
1 0 microarray and the labeled cDNA varieties. 
_ In one type of cDNA microarray experiment, two varieties are used: one taken 

3 from a diseased cell line and one taken from a healthy cell line. The goal of such an 
m experiment is to identify significant differences between the expression of particular 
^ genes in the healthy and diseased states. 

Kj5 In another type of cDNA microarray experiment, the number of varieties 

^ examined may be as high as a hundred or more wherein samples from a single cell 
Q line undergoing a process being investigated are taken at various times during the 
nj process. Each sample taken at a particular time represents a variety and the 

number of varieties in the experiment equals the number of samples taken. 
N20 One-half of each variety is labeled with the red dye and the other half is 

labeled with the green dye. If the experiment involves only two varieties, the red half 
of the first variety is mixed with the green half of the second variety to form a first 
test sample. In a preferred embodiment, the green half of the first variety is mixed 
with the red half of the second variety to form a second test sample. The first test 
sample is applied to a prepared microarray and the microarray is incubated for a set 
period and temperature wherein the cDNA of the first test sample is allowed to 
competitively hybridize with the genes printed on the microarray. After incubation, 
the microarray is washed to remove the unhybridized cDNA. The washed 
microarray is illuminated by light that causes the red and green tags to emit 
fluorescent light. The microarray is scanned wherein the intensities of the 
fluorescent light emitted by the red and green dyes are measured and recorded for 
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each spot on the microarray. The intensities of the fluorescent dye signals depend, 
in part, on the abundance of the corresponding mRNA in the sample. 

In a preferred embodiment, the second test sample is applied to a second 
microarray prepared identically to the first microarray and allowed to competitively 
5 hybridize with the genes printed on the second microarray. After incubation, the 
second microarray is washed to remove the unhybridized cDNA and illuminated by 
light causing the red and green tags to emit fluorescent light. The second 
microarray is scanned and the intensities of the fluorescent light emitted by the red 
and green dyes are measured and recorded for each spot on the second microarray. 
10 If the experiment has more than two varieties, the tagged halves of each 

« variety may be combined with other varieties in a loop design as described in Kerr et 
5 al., "Experimental Design for Gene Expression Microarrays", [online], (2000) The 
m Jackson Laboratory, [retrieved on 2001-05-01], retrieved from the lnternet:<URL: 
http://www.iax.org/research/churchill/research/expression/kerr-desiqn.pdf > herein 
P#5 incorporated by reference in its entirety. In a loop design, each variety is labeled 

once with the red and green dye. In such a design, the varieties are said to be 
H balanced with respect to dyes and means that dye effects are unconfounded with 
?U variety effects. Each solution is applied to an identically prepared microarray. The 
q microarrays are then incubated, washed, and scanned as described for the two- 
4?0 variety experiment. 

Early experiments with microarrays calculated the ratio between the red and 
green dye intensities for each spot on the microarray. If both targets hybridized to 
the probe at equal rates, the red and green intensities for the spot would be roughly 
equal and the R/G ratio would be about one. If, on the other hand, the red target 
25 hybridized at a much higher rate than the green target to the probe, the measured 
red intensity signal would be larger than the green intensity signal and the R/G ratio 
would be larger than one. Conversely, if the green target hybridized at a higher rate 
than the red target to the probe, the R/G ratio would be a small fraction of one. 
Microarray experiments, however, contain a large amount of variability due to the 
30 methods used for preparing and purifying the gene and cDNA samples, spotting the 
polynucleotides on the microarray, scanning the washed microarray after incubation, 
and the variability that arises from the inherent complexity of biological systems. 
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The early experiments with microarrays addressed the variability issue by setting an 
arbitrary threshold level for the intensity ratios. In the early experiments, for 
example, only ratios exceeding two to three times the average of all the intensity 
ratios in the experiment were considered significant. 
5 Simple ratios are adequate for identifying genes with large changes in 

expression but cannot detect small changes in expression. In order to identify genes 
with small, but reproducible, changes in expression, statistical methods must be 
used to analyze the microarray data. 

Dudoit, et al., "Statistical methods for identifying differentially expressed 
10 genes in replicated cDNA microarray experiments", Technical report # 578, 

Department of Statistics, University of California, Berkeley, [online], 2000 [retrieved 
" on 2001 -05-1 5], retrieved from the 

W lnternet:<URL: http://www.stat.berkelev.edu/users/teriv/zarrav/Html/matt.^ 

jjj discloses statistical methods for identifying differentially expressed genes using a t- 

Tl5 statistic with modified p-values on the log 2 (R/G) intensity ratios for each gene in the 

SI experiment. The intensity ratios are first normalized to remove the identified 

p systematic variation in the microarray experiments as described in Yang., et al., 

~] "Normalization for cDNA Microarray Data", Technical report # 589, Department of 

in Statistics, University of California, Berkeley, [online], 2001 [retrieved on 2001-05-15], 

So retrieved from the Internets 

http://www.stat.berkelev.edu/users/terrv/zarrav/TechReport/589.pdf >. 

Analysis of Variance (ANOVA) is another method of analyzing microarray 

datasets. ANOVA is generally described in Montgomery, D. C. Design and analysis 

of experiments, NY, John Wiley & Sons, 1991 , pp. 1 - 515, QA279.M66. A 

25 description of ANOVA applied to microarray datasets is given in Kerr, et al., 
"Analysis of Variance for Gene Expression Microarray Data", Journal of 
Computational Biology 7, 819-837 (2000), herein incorporated by reference. 

ANOVA allocates the variation in an experiment to multiple sources and is 
capable of discerning smaller effects that the threshold ratio technique cannot 

30 handle. One step in the ANOVA procedure requires the calculation of the inverse of 
a matrix of size q where q is the number of parameters (also referred to as the 
dimensionality of the problem) in the microarray experiment. In a typical microarray 
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experiment, the matrix size, q, may be several thousand and require significant 
computational resources in order to determine the inverse matrix. Therefore, there 
exists a need to efficiently perform the ANOVA by reducing the dimensionality of the 
matrix. 

5 

Summary of the Invention 
An efficient method for the analysis of variance of gene expression microarray 
datasets is disclosed for experimental designs wherein the gene factor is orthogonal 
to the other factors in the experiment (A, D & V). The orthogonality of the G factor to 
1 0 the other factors removes the gene-specific terms from the least-squares normal 
3 equations for the non-G factors thereby allowing a sequential solution of first 
S estimating the main effects followed by estimating the gene-specific effects that 
5 does not require the inversion of a large matrix of size about n(a+v) where a is the 
m number of microarrays in the experiment, v is the number of varieties in the 
[J[5 experiment and n is the number of genes in the experiment. The main effects are 
l_ first estimated which requires the inversion of a matrix of size about (a+v-1) « 
m n(a+v). The main effects are then used to estimate two-factor interaction effects for 
fi each gene that requires the inverting a matrix of size (a+v-2) only once. 
O In one embodiment of the present invention, a method for estimating the effects 

^20 of a plurality of factors and at least one of a plurality of interactions between the 

factors in a gene expression microarray experiment generating a microarray dataset 
wherein the factors include a gene factor and a variety factor and the interactions 
include a variety-gene interaction, the gene factor being orthogonal to the other 
factors, the method comprising the steps of "estimatifig"the~f actor effects~based on"a~ 
25 plurality of averages of the microarray dataset and estimating the interaction effects 
based on a plurality of averages of the microarray dataset and on the estimated 
factor effects. Furthermore, estimating the factor effects includes inverting a square 
matrix of size p wherein p is equal to the sum of the number of levels for each non- 
gene factor minus the number of non-gene factors. In addition, the interaction 
30 effects are estimated by inverting a second square matrix of size p' wherein p' is p-7. 



-5- 



I 



In another embodiment of the present invention, a method is disclosed for 
estimating at least one gene-variety interaction in a gene expression microarray 
experiment having an experimental design characterized by a number of degrees of 
freedom, q, and defined by a gene factor, a plurality of non-gene factors, a plurality 
5 of two-factor interactions wherein a full replication of genes is present for every 

combination of the plurality of non-gene factors, the method comprising the steps of: 
inverting a first square matrix characterized by a size, p, wherein p < g; estimating at 
least one of a plurality of non-gene factor effect from the first square matrix inverse; 
constructing a second square matrix based in part on the estimated non-gene factor, 
1 0 the second square matrix characterized by size, p', wherein p'<q; inverting a second 
square matrix; and estimating at least one gene-variety interaction from the inverted 
5 second square matrix. 

2 In another embodiment of the present invention, a method is disclosed for 

m estimating at least one gene-variety interaction in a gene expression microarray 
S|5 experiment generating a dataset and having a design characterized by a arrays, v 
^ varieties, n genes, and d dyes wherein a full replication of genes is present for every 
O combination of arrays, varieties and dyes, the method comprising the steps of: 
fj constructing a global data vector, d, based on a plurality of averages of the dataset; 
wj constructing a square matrix, T, characterized by a size, p, wherein p = a+v+d-3; 
fto inverting the square matrix, T; estimating the global effects, T, wherein x = T d; 
constructing a square matrix, T g , characterized by a size, p', wherein p' = p-1; 
constructing a gene-specific data vector, d g , based on a plurality of averages of the 
dataset; inverting the square matrix, T g ; and estimating the gene-variety interaction, 

x a , wherein T g = Tg~d^: 

25 In another embodiment of the present invention, a system is disclosed for 

estimating the effects of a plurality of factors and at least one of a plurality of 
interactions between the factors in a gene expression microarray experiment 
generating a microarray dataset wherein the factors include a gene factor and a 
variety factor and the interactions include a variety-gene interaction, the gene factor 

30 being orthogonal to the other factors, the system comprising: a processor; a memory 
in signal communication with the processor; and a program stored in the memory, 
the program capable of being executed by the processor, the program including the 
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steps of estimating the main effects based on a plurality of averages of the 
microarray dataset; and estimating the interaction effects based on a plurality of 
averages of the microarray dataset and on the estimated factor effects. 

In another embodiment of the present invention, a method is disclosed for 
5 estimating the effects of a plurality of factors and at least one of a plurality of 
interactions between the factors in a gene expression microarray experiment 
generating a microarray dataset wherein the factors include a gene factor and a 
plurality of non-gene factors and the interactions include at least one of a gene-non- 
gene interaction, the gene factor being orthogonal to the non-gene factors, the 
1 0 method comprising the steps of: constructing a first data model including only non- 
gene factors and non-gene interactions; estimating the effects of the non-gene 
I factors and non-gene interactions based on the first data model and on a plurality of 
| averages of the microarray dataset; creating a transformed dataset from the 
0 microarray dataset and the estimated factor and interaction effects; constructing a 
1J5 second data model including the gene factors and the gene interactions; and 
^ estimating the gene-non-gene interaction effects based on the second data model 
□ and a plurality of averages of the transformed dataset. 

H Brief Description of the Drawings 

3 The present invention may be understood more fully by reference to the 

20 following detailed description of the preferred embodiment of the present invention, 

illustrative examples of specific embodiments of the invention and the appended 

figures in which: 

Fig. 1 is a flowchart illustrating the method of a preferred embodiment of the 
present invention. 

25 Fig. 2 is a block diagram of a preferred embodiment of the present invention. 

Detailed Description of the Preferred Embodiments 
A typical microarray experimental design is described below in order to define 
the variables used in a preferred embodiment of the present invention. It will 
become apparent to those skilled in the art that the present invention is not limited to 
30 the particular design described below. 
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The goal of many experiments is to determine the effect of one or more 
independent variables on one or more dependent variables. Independent variables 
are controlled by the experimenter and the dependent variables are the quantities 
measured by the experimenter. In a microarray experiment, the single dependent 
5 variable is the measured fluorescent light intensities emitted by each dye on each 
spot in the microarray. The independent variables in the microarray experiment are 
the probes (genes) and targets (varieties). 

In addition to independent variables, "environmental" variables may affect the 
measure d response of the dependent variables to such an extent that the 
10 experimenter must consider such environmental variables during both the design of 

the experiment and during the analysis of the experiment's dataset. One example of 
S such a variable is the variation caused by slight differences between each 
2? microarray when more than one microarray is used in an experiment. Although 
CH every effort is made to produce identical microarrays, even slight variations in 
Si 5 spotting, for example, may result in a response bias that could mask the effects the 
M experimenter is ultimately interested in determining. Another example is the 
q different quantum efficiencies of the dyes used in the experiment. The dye with a 
higher quantum efficiency will tend to emit more fluorescent light than the dye with 
Ul the lower quantum efficiency. Since the scanner cannot distinguish the fluorescent 
j]20 light emitted by a "brighter dye and the fluorescent light emitted by a strongly 

hybridized target, unless the systematic bias introduced by the dyes is compensated 
or cancelled, the experimenter will not be able to distinguish the hybridization effects 
from the dye effects, especially when the two effects are roughly of the same 

strength. 

25 In a preferred embodiment, the two independent variables, variety ("V") and 

gene ("G"), and two environmental variables, array ("A") and dye ("D"), are selected 
as the factors of the experiment design. Each factor has a set number of levels. For 
example, the dye factor, D, has two levels designated "red" and "green" in the 
preferred embodiment. The number of levels for the array factor, A, is equal to the 
30 number of microarrays used in the experiment and is designated by "a". Similarly, 
the number of levels for the variety factor, V, is equal to the number of targets used 
in the experiment and is designated by V. The number of levels for the gene factor, 
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G, is equal to the number of distinct probes used in the experiment and is 
designated by tt rf. The effect of each of the factors on the measured response are 
called the main effects. 

In addition to the four main effects, there are six 2-factor interactions, four 3- 
5 factor interactions, and one 4-factor interaction. Not all interactions are expected to 
be significant and the experimenter selects the interactions considered based on 
experience. In a preferred embodiment, the variety x gene ("VG") interaction and 
the array x gene ("AG") interaction are selected for the data model. The VG 
interaction accounts for the effect of variety-gene pairs on the measured fluorescent 

10 light intensity. A large effect of a specific variety-gene pair indicates gene 

expression. The AG interaction accounts for the effect of array-gene pairs on the 

S measured fluorescent light intensity. 

3 In order to allocate the variation in the measured dataset to identified sources 

h of variation, a data model is constructed from factors of interest and from known or 
{15 suspected sources of variation. For purposes of illustration, a four-factor linear data 
^ model including the four main effects and two two-factor interactions is chosen 
□ having the following form: 

y >W = V> + Ai + Dj + Vk + G g + (AG) igs + (VG) kg + e ijkgs (1) 

^ where yjjkgs is the measured fluorescent light intensity from the s spot of the i 
20 array, j th dye, k th variety, and g th gene, 

ix is the average of all measurements, 

Ai is the effect of the i th array, 

Dj is the effect of the j th dye, 



V k is the effect of the k th variety, 



25 G g is the effect of the g th gene, 

(AG)jgs is the effect of the interaction between the i th array and the g th gene, 
(VG)kg is the effect of the interaction between the k th variety and the g th 
gene, and 

eijkgs is the mean zero independent error term of the model. 
30 The main effect Ai accounts for the variation that each individual array sees 

during fabrication and during the experiment that contributes to the variation in the 
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fluorescent signal from array to array. It accounts for variation that may occur when 
arrays are probed under inconsistent conditions that increase or reduce hybridization 
efficiencies of the labeled cDNA. The index, /, ranges from 1 to a where a is equal 
to the number of microarrays used in the experiment. For example, when two 
microarrays are used in the experiment, a = 2 which is also equal to the A-f actor 
levels. The number of degrees of freedom ("dof") for the A factor is equal to (a-1). 

The dye main effect, Dj, measures the differences in the two dye fluorescent 
labels. An example of such a difference may occur because one dye is consistently 
"brighter" than the other dye. The index, j, ranges from 1 to d where d is equal to 
the number of dyes used in the experiment. In a typical gene expression microarray 
experiment, two dyes are used, red and green, so d = 2 and the D-factor levels = 2. 
The dof for the D factor is equal to (d-1). 

The variety main effect, V k , accounts for the variation that arises when 
specific varieties have higher or lower expression levels for all the genes spotted on 
the arrays. The index, k, ranges from 1 to v where vis equal to the number of 
varieties used in the experiment. The V-factor levels are also equal to v and the dof 

for the V factor is equal to (v-1). 

The gene main effect, G g , accounts for the variation that arises when certain 
genes emit a higher or lower fluorescent signal overall compared to other genes. 
This may occur because some genes have generally higher or lower levels of 
expression than other genes or may occur because of the different hybridization 
efficiencies and different labeling efficiencies for the different genes. The index, g, 
ranges from Ho n where n is equal to the number of genes used in the experiment. 
The G-factor levels are also equal to n and the dof for the G factor is equal to (n-1). 

The 2-factor array-gene (AG) interaction accounts for the variation between 
array-gene pairs. The AG interaction, or "spot effects", arises when the spots for a 
given gene on the different arrays vary in the amount of cDNA available for 
hybridization. Since each gene may be replicated on the same microarray, the 
index, s, ranges from 1 to t where t is equal to the number of times each gene is 
spotted on the same array. If each gene is spotted only once on each array, s= 1. 

The 2-factor variety-gene (VG) interaction accounts for the variation between 
variety-gene pairs and is the information the experiment seeks to resolve. 
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A complete experiment will involve a arrays, v varieties, and n genes spotted f 
times on each array. If all possible ijkgs combinations are run, the experiment is 
called a factorial design. The number of free parameters depends on the data 
model selected. For the data model described by equation (1), the number of free 
parameters in the model, q, is (n-1)(v+(a-1)t)+a+v. The number of arrays and 
varieties are typically less than one hundred but the number of genes may be in the 
thousands or tens of thousands. 

The method of fitting the dataset to the linear model of equation (1 ) is called 
linear regression and requires the construction of a design matrix, X, and inverting 
the qxq matrix X T X. Standard statistical software packages, however, are usually 
not able to invert the X T X square matrix for the size usually encountered in a cDNA 

microarray experiment. 

The inventors have discovered a novel method that avoids the necessity of 
inverting a matrix of size q for a certain class of cDNA microarray experimental 
designs. The microarrays used in the typical cDNA experiment are prepared by a 
mechanical robot that is programmed to repeatedly print an array in a certain way. 
Therefore, an assumption may be made that the same set of genes is spotted on 
each microarray in an experiment. This means that a full replication of genes is 
present for every array, dye, and variety combination in any experimental design. 
When such a condition exists, the gene effects are said to be orthogonal to all 
effects of the array, dye and variety factors. The orthogonality of the gene factor to 
the other factors effectively separates the effects into two groups: "global" or "non- 
gene" effects which involve A, D, and V, and gene-specific effects, such as the VG 
interaction r whieh-involve-G^The-separation-into_glQb al effects and g ene^ pecific 
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effects reduces the size of the matrix inversion by three to four or more orders of 
magnitude relative to the standard ANOVA methods. 

The least-squares estimators, by definition, minimize the residual sum of 

squares ("RSS") given by: 

RSS = X [yijkgs-^-M-D r V k -G g -{AG) igs -{VG) kg } (2) 

where the sum is taken over all indices. The partial derivatives of the RSS with 
respect to each of the parameters gives the following set of linear equations. 
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0 in the equations above, the "." as an index indicates an average over that index. For 
fi example, y..... is the average all the fluorescent intensity measurements in the 
0 experiment. Similarly, yi.... is the average of all the fluorescent intensity 
io measurements made on array 1 . The "~" over a variable indicates the least-squares 
3 estimate for that variable. The notation ke i means the varieties k appearing on array 
y i such that if variety k appears on array i in both red and green channels, then it 
2 should appear twice in the summation. Similarly, bk indicates a summation over 
=* the arrays i containing variety k such that if variety k appears on array i in both red 
15 and green channels, then it should appear twice in the summation. The equations 
above also incorporate the "zero-sum" constraints wherein £A ( = £Dj - £r k V k = EG g 
= S g (VG) kg = £ k r k (VG) kg = S is (AG) igs = Z gs (AG) igs = 0 and r kj is the number of times 
variety k appears in-the design-labeled-with dye-j^r k -is-the4otal-replication_of_variety_k 

given by r k = Ej=i,2r k j, and r = £r k . 
20 Equations (4), (5), and (6) define a linear transformation, Tx = d where 

d T = & > • • • , y a -i.... ' y..u. ' ' • ' ' y..v-i.. ' y.i...}~ y..... ( 8 ) 

T r = Ui>-->A«-i>Vi>-"'Vv-i'.Di} ' 9 ' 

where T is of size p' = a+v-1 which is much less than q. Since p' is on the order of 
about 100, the matrix T may be inverted with commonly available matrix inversion 
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algorithms. Alternatively, the system of equations, (4) - (6) may be solved directly 
by Q-R decomposition. 

The two-factor least-squares estimators are given by equations (10) and (11) 

below. 

5 y.. kg . - y.. k .. - y... g . + y..... = - X (AG), + (YG) kg ( 1 0) 

y,. g . - y,,... - y... g . + >-..... = (ag\ + \ Z(VG) kg (1 1 ) 

Equations (10) and (11) define a linear transformation of the form T g x g = d g where 

d T g — — ^i.... •>"'■> y a -i-s» ~ y a -i.~. » y-i,g» ~ » • ■ • > y..v-i,«. — y..v-i..} — y.««. y..... 

_ (12) 
1 0 r T g = {(AG\ g . , (AG) a _ hg . , (VG\ g , • • • , (VG\_ h J (1 3) 

5 and T g is a square matrix of size a+v-2. T g may be inverted with commonly available 
5 matrix inversion algorithms. More importantly, T g is the same for every g and can be 

constructed from T. Alternatively, the system of equations, (1 0) - (1 1 ) may be 
L solved directly by Q-R decomposition to obtain the two-factor estimates. 
En 5 Fig. 1 shows a flowchart illustrating a computer implemented program of a 

]^ preferred embodiment of the present invention. After the microarrays have been 
Q scanned and the fluorescent intensities measured and stored, the data vector, d, for 
the global factors is constructed in 1 1 0. The square matrix T is constructed and 
inverted in 120 using commonly available matrix inversion algorithms. The effects of 
20 the global factors are estimated in 130 by x = T 1 d. The gene-specific matrix, T g , is 
constructed and inverted in 140. For each gene in the experiment, the program first 
constructs the gene-specific data vector for the g* 1 gene in 150 and^imateslhe 
two-factor interaction effects for the g* 1 gene in 160. Steps 150 and 160 are 
repeated for each gene until all the gene-specific interactions have been estimated. 
25 Finally, the program calculates the ANOVA table and residuals in 170 based on the 
estimates using standard techniques known to one of ordinary skill in the statistical 
art. 

Fig. 2 shows a block diagram of another embodiment of the present invention. 
A bus 210 is connected to a processor 220 and provides signal communication 
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between the processor 220 and memory 230, user interface 240, and storage 250. 
The user interface 240 allows the processor to display, print or send information to 
the user and receive data input from the user or another external source. Storage 
250 is capable of permanently storing data and programs executable by the 
5 processor that may be used by the processor 220. Data and programs may be 
transferred to memory 230 for faster access by the processor 220. In a preferred 
embodiment of the present invention, a program embodiment of the flowchart of Fig. 
1 is stored in storage 250. A user may command the processor 220 to execute the 
program via the user interface 240. The processor 220 may also receive a dataset 
10 via the user interface 240 or the dataset may have been previously stored in storage 
250. The processor 220 executes the program and uses the dataset to calculate the 
S ANOVA tables and residuals. After the processor 220 has calculated the tables and 
2 residuals, the processor 220 may either display or print the tables and residuals for 

5 t jj 

Ln the user's review or may store the tables and residuals in storage 250. 
Si 5 It should be apparent to one of ordinary skill in the statistical modeling art that 

s l the present invention is not limited by the choice of the data model described in 
O equation (1). For example, another embodiment of the present invention includes a 
S data model that replaces the variety effect with the array-dye interaction. In addition, 
m the dye-gene interaction may also be added to the data model. 
3b in another embodiment of the present invention, two data models are used to 

analyze the dataset. The first data model includes only non-gene factors and 
interactions. An example of such a data model is given by the equation below. 

>W = A* + A + Dj + (AD) 0 + 0 4 ) 

The A, D, and AD effects may be estimated independent of the gene-specific factors 
25 because of the orthogonality of the gene factor to the other non-gene factors. Using 
the estimates of the A, D, and AD effects obtained using the data model of equation 
(14), the dataset is transformed using the equation below. 

x ijk8S = y ijkg5 -fi-Ai- Dj - ( AD) tf ( 1 5) 
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A second data model including only the gene-specific factors is constructed for the 
transformed dataset having the form shown in the equation below. 

Xijkgs = fl g + (AG) igs + {DG) jg + (VG) kg (1 6) 

The gene-specific effects may be estimated using the second data model gene by 
5 gene and therefore does not require simultaneously solving for all gene effects at 
once. 

The present invention is not to be limited in scope by the specific 
embodiments described herein. Indeed, modifications of the invention in addition to 
those described herein will become apparent to those skilled in the art from the 
~|0 foregoing description and accompanying figures. Doubtless, numerous other 
5 embodiments can be conceived that would not depart from the teaching of the 
m present invention, whose scope is defined by the following claims. 
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